In [ ]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from codefiles.datagen import random_xy, x_plus_noise, data_3d
from codefiles.dataplot import plot_principal_components, plot_3d, plot_2d
# %matplotlib inline
%matplotlib notebook
In [ ]:
data_3D_corr = data_3d(num_points=1000, randomness=0.01, theta_x=30, theta_z=60)
plot_3d(data_3D_corr)
In [ ]:
pca_3d = PCA()
pca_3d.fit(data_3D_corr)
The variance corresponding to each principal component.
In [ ]:
print('P.C. #0 explained variance: %0.7f' % pca_3d.explained_variance_ratio_[0])
print('P.C. #1 explained variance: %0.7f' % pca_3d.explained_variance_ratio_[1])
print('P.C. #2 explained variance: %0.7f' % pca_3d.explained_variance_ratio_[2])
Check the order of magnitude of difference among the 3. At least one of them is not very informative... Let's transform our dataset with the PCA and see how it looks like.
In [ ]:
data_flat = pca_3d.transform(data_3D_corr)
data_flat
We keep the 3D dimensionality with the three P.C. (P.C. 0, P.C. 1 and P.C. 2) but we see now the third column (P.C. 2) with magnitude values much lower than the other ones (P.C. 0 and P.C. 1).
In [ ]:
plot_principal_components(data_flat, x1=0, x2=1)
plt.axis('equal')
plt.title('The most informative (more spread/more variance) view of the data')
plt.show()
From above plot, P.C. 0 more informative than P.C. 1. And next, the comparison between P.C. 1 and P.C. 2 (the one we suspect it is not that informative).
In [ ]:
plot_principal_components(data_flat, x1=1, x2=2)
plt.title('A less informative (less spread/less variance) view of the data')
plt.xlim([-25,25])
plt.ylim([-25,25])
plt.show()
So, if we were in a process of dimensionality reduction then we would be able to extract two components without significant loss of information.
In [ ]:
# generated not rotated dataset
data_3D_corr_0 = data_3d(num_points=1000, randomness=0.01, theta_x=0, theta_z=0)
# Plotting
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
xs = data_3D_corr_0['x'].values
ys = data_3D_corr_0['y'].values
zs = data_3D_corr_0['z'].values
ax.scatter(xs, ys, zs)
ax.set_xlim3d(-20, 20)
ax.set_ylim3d(-20, 20)
ax.set_zlim3d(-20, 20)
plt.title('Data not rotated')
plt.show()
2) Then we rotated the data around the z
axis and the x
axis, of 60
and 30
degrees respectively (by this order!) and we ended up with the first plot in this notebook (check it above).
In [ ]:
from codefiles.datagen import _rot_mat_z, _rot_mat_x
# Rotation angles in degrees
theta_x = 30
theta_z = 60
# The rotation matrix is given by: Rot_x * Rot_z
rotation_matrix = np.dot(_rot_mat_x(theta=np.radians(theta_x)), _rot_mat_z(theta=np.radians(theta_z)))
rotation_matrix
Now, just check the transformation the PCA did above to our dataset (we need to transpose it since it is the inverse transformation):
In [ ]:
pca_3d.components_.transpose()
Looks familiar? :) The PCA just found what we've done! If we did not rotate our dataset, the rotation matrix would have been a 3x3 identity matrix